{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "221f2f2d",
   "metadata": {},
   "source": [
    "## 聚合分组的应用场景"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "15f460f6",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:18:21.893216Z",
     "start_time": "2022-05-17T01:18:21.864048Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'聚合分组非常常见，我们的数据是扁平化的，没有任何分组信息。\\n比如你一周会多次去一家便利店，每次会产生一条购买记录，那么便利店想统计每个人这周的购买情况，就需要以人来进行分组，然后所有的金额相加，就得出每个人的购买额。'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"聚合分组非常常见，我们的数据是扁平化的，没有任何分组信息。\n",
    "比如你一周会多次去一家便利店，每次会产生一条购买记录，那么便利店想统计每个人这周的购买情况，就需要以人来进行分组，然后所有的金额相加，就得出每个人的购买额。\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9f44497",
   "metadata": {},
   "source": [
    "## SQL中的聚合分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "103821cf",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:19:13.897133Z",
     "start_time": "2022-05-17T01:19:13.887005Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nSELECT Column1, Column2, mean(Column3), sum(Column4)\\nFROM SomeTable\\nGROUP BY Column1, Column2\\n'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# SQL 这样做分组和聚合：\n",
    "\n",
    "\"\"\"\n",
    "SELECT Column1, Column2, mean(Column3), sum(Column4)\n",
    "FROM SomeTable\n",
    "GROUP BY Column1, Column2\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "073cddab",
   "metadata": {},
   "source": [
    "## Pandas中的聚合分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7c2aabdb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:20:07.814485Z",
     "start_time": "2022-05-17T01:20:07.803773Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Pandas 也是按照 SQL 的思路，对数据进行非常灵活的分组，分组后会产生一个分组对象，接下来的操作就是就是对这个分组对象的操作。\\n\\n针对分组对象，Pandas 操作了多个方面，如 agg 聚合、apply 自定义函数、迭代等，方便大家实现多种多样的聚合需求。\\n\\n执行完分组统计计算后，再将数据拼成一个最终的数据返回给我们。'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"Pandas 也是按照 SQL 的思路，对数据进行非常灵活的分组，分组后会产生一个分组对象，接下来的操作就是就是对这个分组对象的操作。\n",
    "\n",
    "针对分组对象，Pandas 操作了多个方面，如 agg 聚合、apply 自定义函数、迭代等，方便大家实现多种多样的聚合需求。\n",
    "\n",
    "执行完分组统计计算后，再将数据拼成一个最终的数据返回给我们。\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9916538",
   "metadata": {},
   "source": [
    "## 按列分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6c5dcff0",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:24:00.061979Z",
     "start_time": "2022-05-17T01:24:00.051753Z"
    }
   },
   "outputs": [],
   "source": [
    "# df.groupby() 方法可以按指定字段对 DataFrame 进行分组，生成一个分组器对象，然后再把这个对象的各个字段按一定的聚合方法输出。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f44a3b0",
   "metadata": {},
   "source": [
    "### 语法结构"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a34849b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:24:37.801000Z",
     "start_time": "2022-05-17T01:24:37.780806Z"
    }
   },
   "outputs": [],
   "source": [
    "df.groupby(self, by=None, axis=0, level=None,\n",
    "           as_index: bool=True, sort: bool=True,\n",
    "           group_keys: bool=True,\n",
    "           squeeze: bool=False,\n",
    "           observed: bool=False, dropna=True)\n",
    "# 其中 by 为分组字段，由于是第一个参数可以省略，可以按列表给多个。\n",
    "# 会返回一个groupby_generic.DataFrameGroupBy对象，如果不给定聚合方法，不会返回 DataFrame。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "555c080f",
   "metadata": {},
   "source": [
    "### 基本用法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a125ee5c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:27:25.631779Z",
     "start_time": "2022-05-17T01:27:23.109653Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>team</th>\n",
       "      <th>Q1</th>\n",
       "      <th>Q2</th>\n",
       "      <th>Q3</th>\n",
       "      <th>Q4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Liver</td>\n",
       "      <td>E</td>\n",
       "      <td>89</td>\n",
       "      <td>21</td>\n",
       "      <td>24</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Arry</td>\n",
       "      <td>C</td>\n",
       "      <td>36</td>\n",
       "      <td>37</td>\n",
       "      <td>37</td>\n",
       "      <td>57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Ack</td>\n",
       "      <td>A</td>\n",
       "      <td>57</td>\n",
       "      <td>60</td>\n",
       "      <td>18</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Eorge</td>\n",
       "      <td>C</td>\n",
       "      <td>93</td>\n",
       "      <td>96</td>\n",
       "      <td>71</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Oah</td>\n",
       "      <td>D</td>\n",
       "      <td>65</td>\n",
       "      <td>49</td>\n",
       "      <td>61</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>Gabriel</td>\n",
       "      <td>C</td>\n",
       "      <td>48</td>\n",
       "      <td>59</td>\n",
       "      <td>87</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Austin7</td>\n",
       "      <td>C</td>\n",
       "      <td>21</td>\n",
       "      <td>31</td>\n",
       "      <td>30</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>Lincoln4</td>\n",
       "      <td>C</td>\n",
       "      <td>98</td>\n",
       "      <td>93</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Eli</td>\n",
       "      <td>E</td>\n",
       "      <td>11</td>\n",
       "      <td>74</td>\n",
       "      <td>58</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Ben</td>\n",
       "      <td>E</td>\n",
       "      <td>21</td>\n",
       "      <td>43</td>\n",
       "      <td>41</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        name team  Q1  Q2  Q3  Q4\n",
       "0      Liver    E  89  21  24  64\n",
       "1       Arry    C  36  37  37  57\n",
       "2        Ack    A  57  60  18  84\n",
       "3      Eorge    C  93  96  71  78\n",
       "4        Oah    D  65  49  61  86\n",
       "..       ...  ...  ..  ..  ..  ..\n",
       "95   Gabriel    C  48  59  87  74\n",
       "96   Austin7    C  21  31  30  43\n",
       "97  Lincoln4    C  98  93   1  20\n",
       "98       Eli    E  11  74  58  91\n",
       "99       Ben    E  21  43  41  74\n",
       "\n",
       "[100 rows x 6 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_excel('team.xlsx', sheet_name='Sheet1')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f739fb6d",
   "metadata": {},
   "source": [
    "## 多层索引按单层分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "3cc2eb66",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:30:15.189301Z",
     "start_time": "2022-05-17T01:30:15.167325Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Q1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>team</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>1056</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>963</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Q1\n",
       "team      \n",
       "A     1066\n",
       "B      975\n",
       "C     1056\n",
       "D      860\n",
       "E      963"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们可以实现类似 SQL groupby 那样的数据透视功能：\n",
    "\n",
    "# df.groupby('team').sum() # 按团队分组对应列相加\n",
    "# df.groupby('team').mean() # 按团队分组对应列求平均\n",
    "df.groupby('team').agg({'Q1': sum})  # 按团队分组,将Q1列相加\n",
    "# # 不同列不同的计算方法\n",
    "df.groupby('team').agg({'Q1': sum,  # 总和\n",
    "                        'Q2': 'count', # 总数\n",
    "                        'Q3':'mean', # 平均\n",
    "                        'Q4': max}) # 最大值\n",
    "\n",
    "# \"\"\"注：\n",
    "# 如果按一列聚合，只传列名字符串，如果多个就要传由列名组成的列表\n",
    "# 聚合方法可以使用 Pandas 的数学统计函数 或者 Numpy 的统计函数\n",
    "# 如果是 python 的内置统计函数，直接使用变量，不需要加引号\n",
    "# 如果需要将空值也进行聚合，需要传入 dropna=Flase\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e6f3c69",
   "metadata": {},
   "source": [
    "### Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "97476673",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:40:53.715338Z",
     "start_time": "2022-05-17T01:40:53.689684Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nfirst  second\\nbar    one       1.228388\\n       two       0.578858\\nbaz    one       1.398501\\n       two      -0.070547\\nfoo    one       0.551245\\n       two      -0.604068\\nqux    one       0.643383\\n       two      -0.025163\\ndtype: float64\\n'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 创建数据\n",
    "arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],\n",
    "          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]\n",
    "index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])\n",
    "s = pd.Series(np.random.randn(8), index=index)\n",
    "s\n",
    "'''\n",
    "first  second\n",
    "bar    one       1.228388\n",
    "       two       0.578858\n",
    "baz    one       1.398501\n",
    "       two      -0.070547\n",
    "foo    one       0.551245\n",
    "       two      -0.604068\n",
    "qux    one       0.643383\n",
    "       two      -0.025163\n",
    "dtype: float64\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fd9f09a",
   "metadata": {},
   "source": [
    "#### 按层级"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "febc7de2",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:41:13.663345Z",
     "start_time": "2022-05-17T01:41:13.642560Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nfirst\\nbar    1.807246\\nbaz    1.327954\\nfoo   -0.052823\\nqux    0.618220\\ndtype: float64\\n'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按层级进行分组：\n",
    "\n",
    "s.groupby(level=0).sum()\n",
    "'''\n",
    "first\n",
    "bar    1.807246\n",
    "baz    1.327954\n",
    "foo   -0.052823\n",
    "qux    0.618220\n",
    "dtype: float64\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ff97e76",
   "metadata": {},
   "source": [
    "#### 按索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0c477fc3",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:41:31.341537Z",
     "start_time": "2022-05-17T01:41:31.324021Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_10684/3920823520.py:4: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().\n",
      "  s.sum(level='second') # 也可以直接统计\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'\\nfirst\\nbar    1.807246\\nbaz    1.327954\\nfoo   -0.052823\\nqux    0.618220\\ndtype: float64\\n'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按索引名分组：\n",
    "\n",
    "s.groupby(level='second').sum()\n",
    "s.sum(level='second') # 也可以直接统计\n",
    "'''\n",
    "first\n",
    "bar    1.807246\n",
    "baz    1.327954\n",
    "foo   -0.052823\n",
    "qux    0.618220\n",
    "dtype: float64\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4cb5f24",
   "metadata": {},
   "source": [
    "### DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "aec1242d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:43:34.186861Z",
     "start_time": "2022-05-17T01:43:34.169037Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Max Speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Animal</th>\n",
       "      <th>Type</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Falcon</th>\n",
       "      <th>Captive</th>\n",
       "      <td>390.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wild</th>\n",
       "      <td>350.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Parrot</th>\n",
       "      <th>Captive</th>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wild</th>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Max Speed\n",
       "Animal Type              \n",
       "Falcon Captive      390.0\n",
       "       Wild         350.0\n",
       "Parrot Captive       30.0\n",
       "       Wild          20.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 数据准备\n",
    "arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n",
    "          ['Captive', 'Wild', 'Captive', 'Wild']]\n",
    "index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n",
    "df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n",
    "                  index=index)\n",
    "df\n",
    "# '''\n",
    "#                 Max Speed\n",
    "# Animal Type\n",
    "# Falcon Captive      390.0\n",
    "#        Wild         350.0\n",
    "# Parrot Captive       30.0\n",
    "#        Wild          20.0\n",
    "# '''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0b67c6f",
   "metadata": {},
   "source": [
    "#### 按层级"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9fd56998",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T01:44:02.624884Z",
     "start_time": "2022-05-17T01:44:02.596227Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nfirst\\nbar    1.807246\\nbaz    1.327954\\nfoo   -0.052823\\nqux    0.618220\\ndtype: float64\\n'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按层级进行分组：\n",
    "\n",
    "df.groupby(level=0).mean()\n",
    "'''\n",
    "        Max Speed\n",
    "Animal\n",
    "Falcon      370.0\n",
    "Parrot       25.0\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c97bde85",
   "metadata": {},
   "source": [
    "#### 按索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df83d6c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 按索引名分组：\n",
    "\n",
    "df.groupby(level=\"Type\").mean()\n",
    "'''\n",
    "         Max Speed\n",
    "Type\n",
    "Captive      210.0\n",
    "Wild         185.0\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdfea41d",
   "metadata": {},
   "source": [
    "## 多层索引按多层分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5876691a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T02:33:54.414124Z",
     "start_time": "2022-05-17T02:33:54.391030Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nfirst  second  third\\nbar    doo     one     -1.131345\\n               two     -0.089329\\nbaz    bee     one      0.337863\\n               two     -0.945867\\nfoo    bop     one     -0.932132\\n               two      1.956030\\nqux    bop     one      0.017587\\n               two     -0.016692\\ndtype: float64\\n'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 数据准备\n",
    "arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],\n",
    "          ['doo', 'doo', 'bee', 'bee', 'bop', 'bop', 'bop', 'bop'],\n",
    "          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]\n",
    "index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second', 'third'])\n",
    "s = pd.Series(np.random.randn(8), index=index)\n",
    "\n",
    "# 数据如下\n",
    "'''\n",
    "first  second  third\n",
    "bar    doo     one     -1.131345\n",
    "               two     -0.089329\n",
    "baz    bee     one      0.337863\n",
    "               two     -0.945867\n",
    "foo    bop     one     -0.932132\n",
    "               two      1.956030\n",
    "qux    bop     one      0.017587\n",
    "               two     -0.016692\n",
    "dtype: float64\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00e70ce8",
   "metadata": {},
   "source": [
    "### 两个层级进行分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5a75e1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 两个层级进行分组：\n",
    "\n",
    "s.groupby(level=['first', 'second']).sum()\n",
    "'''\n",
    "first  second\n",
    "bar    doo      -1.220674\n",
    "baz    bee      -0.608004\n",
    "foo    bop       1.023898\n",
    "qux    bop       0.000895\n",
    "dtype: float64\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f47c1ce8",
   "metadata": {},
   "source": [
    "### 按层名进行分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "3b60769c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-05-17T02:37:02.753559Z",
     "start_time": "2022-05-17T02:37:02.736424Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "first  second\n",
       "bar    doo       1.427401\n",
       "baz    bee      -0.942108\n",
       "foo    bop       0.934407\n",
       "qux    bop      -1.521312\n",
       "dtype: float64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按层名进行分组，注意，这个和上边使用了两个不同的参数。\n",
    "s.groupby( ['first', 'second']).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d16955a4",
   "metadata": {},
   "source": [
    "## 使用分组器 Grouper分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60f4ffe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据准备\n",
    "arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],\n",
    "          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]\n",
    "index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])\n",
    "df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],\n",
    "                   'B': np.arange(8)},\n",
    "                  index=index)\n",
    "df\n",
    "'''\n",
    "              A  B\n",
    "first second\n",
    "bar   one     1  0\n",
    "      two     1  1\n",
    "baz   one     1  2\n",
    "      two     1  3\n",
    "foo   one     2  4\n",
    "      two     2  5\n",
    "qux   one     3  6\n",
    "      two     3  7\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a5aeb87",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 使用分组器对第一层索引（second）和 A 列进行分组：\n",
    "\n",
    "# 三个方式效果一样\n",
    "df.groupby([pd.Grouper(level=1), 'A']).sum()\n",
    "df.groupby([pd.Grouper(level='second'), 'A']).sum()\n",
    "df.groupby(['second', 'A']).sum()\n",
    "'''\n",
    "          B\n",
    "second A\n",
    "one    1  2\n",
    "       2  4\n",
    "       3  6\n",
    "two    1  4\n",
    "       2  5\n",
    "       3  7\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "046d12a7",
   "metadata": {},
   "source": [
    "## 数据分箱"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4267278c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据分箱（Data binning，也称为离散组合或数据分桶）是一种数据预处理技术，\n",
    "# 将原始数据分成小区间，即一个 bin（小箱子），它是一种量子化的形式。\n",
    "\n",
    "# Pandas 实现连续数据的离散化处理主要基于两个函数：\n",
    "# pandas.cut 根据指定分界点对连续数据进行分箱处理\n",
    "# pandas.qcut 根据指定箱子的数量对连续数据进行等宽分箱处理"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d44db974",
   "metadata": {},
   "source": [
    "### pd.cut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9d1fba3",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"pd.cut函数有7个参数，主要用于对数据从最大值到最小值进行等距划分\n",
    "\n",
    " pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False,  duplicates=‘raise’)\n",
    "\n",
    "参数：\n",
    "\n",
    "x : 输入待cut的一维数组\n",
    "\n",
    "bins : cut的段数，一般为整型，但也可以为标量序列或者间隔索引，是进行分组的依据，\n",
    "如果填入整数n，则表示将x中的数值分成等宽的n份（即每一组内的最大值与最小值之差约相等）；\n",
    "如果是标量序列，序列中的数值表示用来分档的分界值\n",
    "如果是间隔索引，“ bins”的间隔索引必须不重叠\n",
    "\n",
    "right : 布尔值，确定右区间是否开闭，取True时右区间闭合\n",
    "当“ right = True”（默认值）时，则“ bins”=[1、2、3、4]表示（1,2]，（2,3],（3,4]\n",
    "当bins是一个间隔索引时，该参数被忽略。\n",
    "\n",
    "labels : 指定分箱的标签, 数组或布尔值，默认为None，用来标识分后的bins，长度必须与结果bins相等，返回值为整数或者对bins的标识\n",
    "如果是数组，长度要与分箱个数一致，比如“ bins”=[1、2、3、4]表示（1,2]，（2,3],（3,4]一共3个区间，则labels的长度也就是标签的个数也要是3\n",
    "如果为False，则仅返回分箱的整数指示符，即x中的数据在第几个箱子里\n",
    "当bins是间隔索引时，将忽略此参数\n",
    "\n",
    "retbins : 布尔值，可选。是否返回数值所在分组，Ture则返回\n",
    "retbins=True以显示分界值，得到划分后的区间\n",
    "\n",
    "precision : 整型，bins小数精度，也就是数据以几位小数显示,默认3，存储和显示分箱标签的精度\n",
    "\n",
    "include_lowest : 布尔类型，是否包含左区间.表示区间的左边是开还是闭，默认为false，也就是不包含区间左边。\n",
    "\n",
    "duplicates：如果分箱临界值不唯一，则引发ValueError或丢弃非唯一\"\"\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "400c2826",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     (70, 100]\n",
       "1       (0, 60]\n",
       "2       (0, 60]\n",
       "3     (70, 100]\n",
       "4      (60, 70]\n",
       "        ...    \n",
       "95      (0, 60]\n",
       "96      (0, 60]\n",
       "97    (70, 100]\n",
       "98      (0, 60]\n",
       "99      (0, 60]\n",
       "Name: Q1, Length: 100, dtype: category\n",
       "Categories (3, interval[int64, right]): [(0, 60] < (60, 70] < (70, 100]]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pd.cut\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_excel('team.xlsx', sheet_name='Sheet1')\n",
    "df\n",
    "\n",
    "# 指定区间将数字进行划分，以下三个值将数据划分成两个区间（及格或者不及格）：\n",
    "pd.cut(df.Q1, bins=[0, 60, 100])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35c36fea",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# 不用区间，使用数字做为标签（0，1，2，n）\n",
    "pd.cut(df.Q1, bins=[0, 60, 100],labels=False)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "facdd86d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      及格\n",
       "1     不及格\n",
       "2     不及格\n",
       "3      及格\n",
       "4      及格\n",
       "     ... \n",
       "95    不及格\n",
       "96    不及格\n",
       "97     及格\n",
       "98    不及格\n",
       "99    不及格\n",
       "Name: Q1, Length: 100, dtype: category\n",
       "Categories (2, object): ['不及格' < '及格']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 指定标签名\n",
    "pd.cut(df.Q1, bins=[0, 60, 100],labels=['不及格','及格',])\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4c2fabc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 包含最低部分\n",
    "pd.cut(df.Q1, bins=[0, 60, 100], include_lowest=True)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50347681",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 是否包含右边，闭区间，下例 [89, 100)\n",
    "pd.cut(df.Q1, bins=[0, 89, 100], right=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3d04c66",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 应用到分组中：\n",
    "\n",
    "df.Q1.groupby(pd.cut(df.Q1, bins=[0, 60, 100])).count()\n",
    "\n",
    "df.groupby(pd.cut(df.Q1, bins=[0, 60, 100])).count()\n",
    "\n",
    "\n",
    "# 按区间做映射的例子：\n",
    "\n",
    "df = pd.DataFrame({'A':[1, 3, 5, 7, 9]})\n",
    "df.assign(B=pd.cut(df.A, [0,5,7, float('inf')], \n",
    "                   labels=['差','中','好'])\n",
    "         )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b1649e2",
   "metadata": {},
   "source": [
    "### pd.qcut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "548fc26e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'试想一下如果我们有一个很大的数据集，需要对其中一项进行分箱，\\n分箱的依据不是单纯的等宽箱体或者没有确定的分解值，而是按照分位数进行分箱，\\n比如前四分之一的是一个箱体这种要求，\\n用pd.cut()不是不能实现，只是比较麻烦，还要先计算分位数作为分解值。\\n这个时候，pd.qcut()就方便很多了。'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pd.qcut 指定所分箱子的数量，pandas 会自动进行分箱：\n",
    "\n",
    "\"\"\"试想一下如果我们有一个很大的数据集，需要对其中一项进行分箱，\n",
    "分箱的依据不是单纯的等宽箱体或者没有确定的分解值，而是按照分位数进行分箱，\n",
    "比如前四分之一的是一个箱体这种要求，\n",
    "用pd.cut()不是不能实现，只是比较麻烦，还要先计算分位数作为分解值。\n",
    "这个时候，pd.qcut()就方便很多了。\"\"\"\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7588315",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates=‘raise’)\n",
    "\n",
    "\"\"\"x ：一维数组或者Serise\n",
    "\n",
    "q ： 表示分位数的整数或者数组，\n",
    "\n",
    "如果是分位数的整数，例如10用于十分位，4用于四分位\n",
    "如果是分位数数组，例如[0,0.25,0.5,0.75,1]用于四分位数\n",
    "labels ： 数组或者布尔值，默认为none，用于指定每个箱体的标签\n",
    "\n",
    "如果是数组，长度要与分箱个数一致，比如用四分位数分箱，需要指定四个标签\n",
    "如果为False，则仅返回分箱的整数指示符，即当前数据位于哪个箱子中\n",
    "rebines ：布尔值，可选。 是否显示分箱的分界值。（由于是按照分位数进行分箱，在不知道分位数具体数值的情况下，可以通过这个参数设置显示分界值即分位数的具体数值）\n",
    "\n",
    "precision：整数，默认3，存储和显示分箱标签的精度。\n",
    "\n",
    "duplicates：如果分箱临界值不唯一，则引发ValueError或丢弃非唯一\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f62b8498",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.qcut(df.Q1,q=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c1d86a09",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]\n",
       "Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.qcut(range(5), 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7e63c626",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 1, 2, 3], dtype=int64)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.qcut(range(5), 4, labels=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b6571a4b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['good', 'good', 'medium', 'bad', 'bad']\n",
       "Categories (3, object): ['good' < 'medium' < 'bad']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 指定标签名\n",
    "pd.qcut(range(5), 3, labels=[\"good\", \"medium\", \"bad\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e5573889",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0      (51.5, 98.0]\n",
       " 1     (0.999, 51.5]\n",
       " 2      (51.5, 98.0]\n",
       " 3      (51.5, 98.0]\n",
       " 4      (51.5, 98.0]\n",
       "           ...      \n",
       " 95    (0.999, 51.5]\n",
       " 96    (0.999, 51.5]\n",
       " 97     (51.5, 98.0]\n",
       " 98    (0.999, 51.5]\n",
       " 99    (0.999, 51.5]\n",
       " Name: Q1, Length: 100, dtype: category\n",
       " Categories (2, interval[float64, right]): [(0.999, 51.5] < (51.5, 98.0]],\n",
       " array([ 1. , 51.5, 98. ]))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 返回箱子标签 array([ 1. , 51.5, 98. ]))\n",
    "pd.qcut(df.Q1,q=2, retbins=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "35d186ea",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      (51.5, 98.0]\n",
       "1     (0.999, 51.5]\n",
       "2      (51.5, 98.0]\n",
       "3      (51.5, 98.0]\n",
       "4      (51.5, 98.0]\n",
       "          ...      \n",
       "95    (0.999, 51.5]\n",
       "96    (0.999, 51.5]\n",
       "97     (51.5, 98.0]\n",
       "98    (0.999, 51.5]\n",
       "99    (0.999, 51.5]\n",
       "Name: Q1, Length: 100, dtype: category\n",
       "Categories (2, interval[float64, right]): [(0.999, 51.5] < (51.5, 98.0]]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 分箱位小数位数\n",
    "pd.qcut(df.Q1,q=2,precision=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "28cd9522",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     (67.0, 100.0]\n",
       "1      (34.0, 67.0]\n",
       "2      (34.0, 67.0]\n",
       "3     (67.0, 100.0]\n",
       "4      (34.0, 67.0]\n",
       "          ...      \n",
       "95     (34.0, 67.0]\n",
       "96    (0.999, 34.0]\n",
       "97    (67.0, 100.0]\n",
       "98    (0.999, 34.0]\n",
       "99    (0.999, 34.0]\n",
       "Name: Q1, Length: 100, dtype: category\n",
       "Categories (3, interval[float64, right]): [(0.999, 34.0] < (34.0, 67.0] < (67.0, 100.0]]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 排名分3个层次\n",
    "pd.qcut(df.Q1.rank(method='first'),3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a021e943",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "255.783px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
